import pandas as pd
import numpy as np
import soundfile as sf
from musicntd.model.current_plot import *
import musicntd.data_manipulation as dm
import musicntd.autosimilarity_segmentation as as_seg
import musicntd.tensor_factory as tf
import musicntd.model.features as features
import musicntd.scripts.hide_code as hide
import nn_fac.ntd as NTD
This notebook aims at studying different representations for music, and at finding the more interesting for our context (description of the signal in NTD).
Here, we will compare:
This notebook is organised as follows:
dataset_path = "C:\\Users\\amarmore\\Desktop\\Audio samples\\RWC Pop\\Entire RWC"
song_number = 1
song_path = dataset_path + "\\{}.wav".format(song_number)
# Choice of the annotation
annotations_type = "MIREX10"
annotations_folder = "C:\\Users\\amarmore\\Desktop\\Audio samples\\RWC Pop\\annotations\\{}\\".format(annotations_type)
annotation_path = annotations_folder + dm.get_annotation_name_from_song(song_number, annotations_type)
hop_length = 512
n_fft = hop_length * 4
the_signal, sampling_rate = sf.read(song_path)
hop_length_seconds = hop_length/sampling_rate
stft_spec = features.get_spectrogram(the_signal[:,0], sampling_rate,
feature="stft", n_fft = n_fft, hop_length = hop_length)
plot_me_this_spectrogram(stft_spec, invert_y_axis = False, x_axis = "Time (in number of frames)", y_axis = "Frequency (in indexes of bandwidths)")
In this example, low frequencies seem to dominate the decomposition. This is related to acoustic properties of the human ear, where high frequencies are perceived stronger than low ones at a same sound intensity. In that sense, to be perceived equally, low frequencies are usually more intense than high frequencies.
cqt_spec = features.get_spectrogram(the_signal[:,0], sampling_rate,
feature="cqt", n_fft = n_fft, hop_length = hop_length)
plot_me_this_spectrogram(cqt_spec, invert_y_axis = False, x_axis = "Time (in number of frames)", y_axis = "Index of Constant-Q bandwidth")
CQT seems to handle more mid-frequencies information then STFT, which, empirically, seems desirable.
Note that the 0 in the y-axis refer to the first Constant-Q bandwidth and not a "0 note". Bandwidths are aligned with midi notes, and the first bandwidth (0) is the 24-th midi note (C1). Hence, bandwidths should be considered with an offset of 24 to be converted in midi scale.
pcp_spec = features.get_spectrogram(the_signal[:,0], sampling_rate,
feature="pcp", n_fft = n_fft, hop_length = hop_length)
plot_me_this_spectrogram(pcp_spec, invert_y_axis = False, x_axis = "Time (in number of frames)", y_axis = "Note index (in pitch-class, semi-tone spaced)")
PCP represent the harmonic content in the song, and discard the percussive content.
In this implementation, PCP are computed from the CQT of the signal.
Now, we will compute NTD on these three examples, and compare the resulting segmentations.
Firstly, we will load the annotation.
# Loading and formatting annotations
annotations = dm.get_segmentation_from_txt(annotation_path, annotations_type)
references_segments = np.array(annotations)[:,0:2]
In order to perform NTD on these examples, we need to transform the chromagrams in "Time-Frequency-Bar tensor", i.e. tensors were time is divided in two scales:
(see the article "Uncovering Audio Patterns in Music with Nonnegative Tucker Decomposition For Structural Segmentation", soon published at the time I write these lines, for detailled explanation of this tensor).
In order to construct our tensor, we first need to estimate the downbeats. This is made via the madmom toolbox [4].
# Estimate the downbeats/bar frontiers
bars = dm.get_bars_from_audio(song_path)
# Convert the annotation in the bar scale: this is used for plotting the annotation on our figures.
annotations_frontiers_barwise = dm.frontiers_from_time_to_bar(np.array(annotations)[:,1], bars)
We then cut the chromagram at each downbeat, in order to have a collection of chromagrams, one for each bar.
This collection of chromagrams will then form the tensor. A problem though is that bars can be of different length, resulting in chromagrams of different length. This is a problem, as tensor need slices (i.e. the chromagrams taken individually) of the same size.
In these experimentations, we decided to search for the longest bar, use its length (in number of frames) as the default dimension for all bars, and to zero-pad the ones who were shorter, as in [2].
This method was then changed, see the 3rd notebook for details.
stft_tensor_spectrogram = tf.tensorize_barwise(stft_spec, bars, hop_length_seconds)
cqt_tensor_spectrogram = tf.tensorize_barwise(cqt_spec, bars, hop_length_seconds)
pcp_tensor_spectrogram = tf.tensorize_barwise(pcp_spec, bars, hop_length_seconds)
# One particular slice of the tensors, representing a particular bar (48-th bar exactly).
idx = 48
hide.slice_tensor_spectrogram(idx, stft_tensor_spectrogram,cqt_tensor_spectrogram,pcp_tensor_spectrogram)
NB: note that these bars contain blank frames at the end. This is because, at the time of these experimentation, bars weren't fitted to hold the same number of frames, but rather the size of the largest bar, and smaller bars were padded with zero frames, as in [2].
# Rank selection
ranks = [32,32,32]
Now, we decompose these tensors by NTD.
Firstly, we will print factor matrices resulting of this decomposition, and secondly, we will print the autosimilarity of the $Q$ (and normalized $Q$) matrices, along with the autosimilarity of the signal, for comparison.
stft_core, stft_factors = NTD.ntd(stft_tensor_spectrogram, ranks = ranks, init = "tucker", verbose = False, hals = False,
sparsity_coefficients = [None, None, None, None], normalize = [True, True, False, True])
hide.nice_plot_factors(stft_factors)
hide.nice_plot_autosimilarities(stft_factors[2], stft_tensor_spectrogram, annotations_frontiers_barwise)
cqt_core, cqt_factors = NTD.ntd(cqt_tensor_spectrogram, ranks = ranks, init = "tucker", verbose = False, hals = False,
sparsity_coefficients = [None, None, None, None], normalize = [True, True, False, True])
hide.nice_plot_factors(cqt_factors)
hide.nice_plot_autosimilarities(cqt_factors[2], cqt_tensor_spectrogram, annotations_frontiers_barwise)
NB: Chroma dimension being 12, rank of $W$ shouldn't exceed 12 as it will probably create redundancy in the factors, and be counter-productive to a salient decomposition. It is then fixed to 12.
pcp_core, pcp_factors = NTD.ntd(pcp_tensor_spectrogram, ranks = [12, 32, 32], init = "tucker", verbose = False, hals = False,
sparsity_coefficients = [None, None, None, None], normalize = [True, True, False, True])
hide.nice_plot_factors(pcp_factors)
hide.nice_plot_autosimilarities(pcp_factors[2], pcp_tensor_spectrogram, annotations_frontiers_barwise)
In order to compare these three decomposition, we will study the segmentation computed from their autosimilarities.
We have developed in that context 2 segmentation methods based on the principle of a kernel sliding along the diagonal of the autosimilarity.
Both methods are respectively:
Firstly, the novelty computation.
This method is the method developed in [1].
The idea is to apply a kernel containing only 1 and -1 values, and disposed like that: $\left[ \begin{matrix} 1 & 1 & -1 & -1\\ 1 & 1 & -1 & -1\\ -1 & -1 & 1 & 1 \\ -1 & -1 & 1 & 1 \end{matrix} \right]$ (of size 4 here, but of size 16 in further tests).
For each bar of the song, we center the kernel on its index on the diagonal of the autosimilarity matrix, crop the autosimilarity to be of the size of the kernel, and sum all values obtained by the term-to-term product between both equal-size matrices.
Positive terms represent the inner similarity of both near past and near future of this bar. A high score indicates that both past and future are similar with themselves.
Negative terms represent the intra similarity between these two zones, i.e. the similarity between the near past and the near future of this bar. If this term is high, it means that both the past and the future are similar, and the resulting novelty score will be low as both positive and negative terms will cancel themselves. Conversely, if this term is low, it means that near past and near future are dissimilar, and so the novelty score will be high. Hence, this bar will be probably a frontier between two really different zones.
Consecutively to this novelty score, computed on every bar of the song, frontiers will be selected as follows:
Hence, frontiers are the bar indexes of the peaks of novelty measure, thresholded comparetively to the highest value.
Below are shown visualization of this novelty measure, in black, so as frontiers, as bars:
hide.nice_plot_novelty_measures(stft_factors[2], cqt_factors[2], pcp_factors[2], annotations_frontiers_barwise)
From these frontiers, we compute some evaluation scores (in French in the legend):
hide.print_dataframe_results_novelty(stft_factors[2], cqt_factors[2], pcp_factors[2], bars, references_segments)
The second segmentation algorithm is the convolution computation.
The main idea is that, the darker is the zone we analyze, the more similar it is, and hence the more probable this zone belongs to a same segment.
For that, we want to evaluate the quantity of dark areas outside from the diagonal. Indeed, the diagonal will always be highly similar (as it represent the self-similarity of this bar), and will dominate the other values of similarity.
The associated convolution cost is obtained with a kernel of this shape: $\left[ \begin{matrix} 0 & 1 & 1 & 1\\ 1 & 0 & 1 & 1\\ 1 & 1 & 0 & 1 \\ 1 & 1 & 1 & 0 \end{matrix} \right]$ (of size 4 here).
Hence, the convolution cost for a segment ($b_1$, $b_2$) is the sum of the term-to-term product of this kernel and the associated autosimilarity of this segment, centered on the diagonal and cropped to the boundaries of the segment (i.e. the autosimilarity restricted to the indexes $(i,j) \in [b_1, b_2]^2$). The kernel and the autosimilarity must be of the same size.
Starting from this cost, we aim at finding the sequence of segments maximizing the local convolution costs.
To this end, we use a dynamic programming algortihm (inspired from [3]) which try to maximimze the global sum of all costs of a sequence of segments:
This algorithm then return the sequence of segments which maximizes the sum of all segment cost, among all possible segments in this song.
NB: more details about this algorithm can be found in the Notebook "Appendix - Focus on the segmentation algorithm".
Below are shown the autosimilarities of the given song, along with frontiers estimated by the convolution algorithm:
hide.nice_plot_convolution_measures(stft_factors[2], cqt_factors[2], pcp_factors[2], annotations_frontiers_barwise)
plot_spec_with_annotations_and_prediction(factors[2], annotations_frontiers_barwise, segments)
From these frontiers, we compute some evaluation scores (in French in the legend):
hide.print_dataframe_results_convolution_features(stft_factors[2], cqt_factors[2], pcp_factors[2], bars, references_segments)
Finally, we computed these scores on all songs (100) of the RWC Popular dataset, with the convolution computation algorithm.
Ranks were all set to 32 (except for $W$ in PCP, where its uncompressed dimension is 12, so the rank were set to 12, as above).
From these frontiers, we compute the same evaluation scores as before.
hide.run_and_show_results_on_rwc()
In conclusion, with these results, we decided:
[1] Foote, J. (2000, July). Automatic audio segmentation using a measure of audio novelty. In 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No. 00TH8532) (Vol. 1, pp. 452-455). IEEE.
[2] Smith, J. B., & Goto, M. (2018, April). Nonnegative tensor factorization for source separation of loops in audio. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 171-175). IEEE.
[3] Gabriel Sargent, Frédéric Bimbot, and Emmanuel Vincent. Estimating the structural segmentation of popular music pieces under regularity constraints. IEEE/ACMTransactions on Audio, Speech, and Language Processing, 25(2):344–358, 2016.
[4] Böck, S., Korzeniowski, F., Schlüter, J., Krebs, F., & Widmer, G. (2016, October). Madmom: A new python audio and music signal processing library. In Proceedings of the 24th ACM international conference on Multimedia (pp. 1174-1178).